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Abstract 

Deterministic execution offers many benefits for debug- 
ging, fault tolerance, and security. Running parallel pro- 
grams deterministically is usually difficult and costly, 
however — especially if we desire system-enforced de- 
terminism, ensuring precise repeatability of arbitrarily 
buggy or malicious software. Determinator is a novel 
operating system that enforces determinism on both mul- 
tithreaded and multi-process computations. Determi- 
nator's kernel provides only single-threaded, "shared- 
nothing" address spaces interacting via deterministic 
synchronization. An untrusted user-level runtime uses 
distributed computing techniques to emulate familiar ab- 
stractions such as Unix processes, file systems, and 
shared memory multithreading. The system runs parallel 
applications deterministically both on multicore PCs and 
across nodes in a cluster Coarse-grained parallel bench- 
marks perform and scale comparably to — sometimes bet- 
ter than — conventional systems, though determinism is 
costly for fine-grained parallel applications. 

1 Introduction 

It is often useful to run software deterministically, ensur- 
ing a given program and input always yields exactly the 
same result. Deterministic execution makes bugs repro- 
ducible, and is required for "record-and-replay" debug- 
ging I.28.40J . Fault tolerance Ill5ll8i49j and accountabil- 
ity mechanisms ||33]| rely on execution being determinis- 
tic and bit-for-bit identical across state replicas. Intru- 
sion analysis II231I36I and timing channel control L4J can 
further benefit from system-enforced determinism, where 
the system prevents application code from depending on 
execution timing or other unintended inputs even if the 
code is maliciously designed to do so. 

Multicore processors and ubiquitous parallelism make 
programming environments increasingly nondeterminis- 
tic, however. Nondeterminism makes software harder to 
develop and debug l43]|44l. Race detectors help II27II45I . 
but even properly synchronized programs may have 
higher-level heisenbugs |[3]. The cost of logging and re- 
playing the internal nondeterministic events in parallel 
software [20,^4] can be orders of magnitude higher than 
that of logging only a computation's external inputs, es- 
pecially for system-enforced replay Il23ll24ll . This cost 
usually precludes logging "normal-case" execution, di- 



minishing the technique's effectiveness. A heisenbug 
or intrusion that manifests "in the field" with logging 
disabled may not reappear during subsequent logged 
attempts to reproduce it — especially with malware de- 
signed to evade analysis by detecting the timing impact 
of logging or virtualization ll30l . 

Motivated by its many uses, we would like system- 
enforced determinism to be available for normal-case ex- 
ecution of parallel applications. To test this goal's feasi- 
bility, we built Determinator, an operating system that 
not only executes individual processes deterministically, 
as in deterministic user-level scheduling |8,9|, but can 
enforce determinism on hierarchies of interacting pro- 
cesses. Rerunning a multi-process Determinator compu- 
tation with the same inputs yields exactly the same out- 
puts, without internal event logging. Determinator treats 
all potential nondeterministic inputs to a computation — 
including all timing information — as "privileged infor- 
mation," which normal applications cannot obtain except 
via controlled channels. We treat deterministic execu- 
tion as not just a debugging tool but a security principle: 
if malware infects an unprivileged Determinator applica- 
tion, it should be unable to evade replay-based analysis. 

System-enforced determinism is challenging because 
current programming environments and APIs are riddled 
with timing dependencies. Most shared-memory parallel 
code uses mutual exclusion primitives; even when used 
correctly, timing determines the application-visible order 
in which competing threads acquire a mutex. Concur- 
rency makes names allocated from shared namespaces, 
such as pointers returned by ma 1 1 o c ( ) and file descrip- 
tors returned by open ( ) , timing-dependent. Synchro- 
nizing operations like semaphores, message queues, and 
wait nondeterministic ally return "the first" event, 
message, or terminated process available. Even single- 
threaded processes are nondeterministic when run in 
parallel, due to their interleaved accesses to shared re- 
sources. A parallel 'make - j ' command often presents 
a chaotic mix of its child tasks' outputs, for example, and 
missing dependencies can yield "makefile heisenbugs" 
that manifest only under parallel execution. 

Addressing these challenges in Determinator led us to 
the insight that timing dependencies commonly fall into 
a few categories: unintended interactions via shared state 
or namespaces; synchronization abstractions with share- 



able endpoints; true dependencies on "real-world" time; 
and application-level scheduling. Determinator avoids 
physically shared state by isolating concurrent activities 
during normal execution, allowing interaction only at ex- 
plicit synchronization points. The kernel's API uses lo- 
cal, application-chosen names in place of shared, OS- 
managed namespaces. Synchronization primitives op- 
erate "one-to-one," between specific threads, preventing 
threads from "racing" to an operation. Determinator 
treats access to real-world time as I/O, controlling it as 
with other devices such as disk or network. Finally, De- 
terminator requires scheduling to be separated from ap- 
plication logic and handled by the system, or else emu- 
lated using a deterministic, virtual notion of "time." 

Since we wish to derive basic principles for system- 
enforced determinism, Determinator currently makes no 
attempt at compatibility with existing operating systems, 
and provides limited compatibility with existing APIs. 
The kernel's low-level API offers only one user-visible 
abstraction, spaces, representing execution state and vir- 
tual memory, and only three system calls by which 
spaces synchronize and communicate. The API's min- 
imality facilitates both experimentation and reasoning 
about its determinism. Despites this simplicity, our un- 
trusted, user-level runtime builds atop the kernel to pro- 
vide familiar programming abstractions. The runtime 
uses file replication and versioning pTl to offer appli- 
cations a logically shared file system via standard APIs; 
distributed shared memory M2I17II to create multithreaded 
processes logically sharing an address space; and deter- 
ministic scheduling Il8ll9l l22l to support pthreads-style 
synchronization. Since the kernel enforces determinism, 
bugs or vulnerabilities in this runtime cannot compro- 
mise the determinism guarantee. 

Experiments with common parallel benchmarks sug- 
gest that Determinator can run coarse-grained paral- 
lel applications deterministically with both performance 
and scalability comparable to nondeterministic environ- 
ments. Determinism incurs a high cost on fine-grained 
parallel applications, however, due to Determinator's use 
of virtual memory to isolate threads. For "embarrass- 
ingly parallel" applications requiring little inter-thread 
communication, Determinator can distribute the com- 
putation across nodes in a cluster mostly transparently 
to the application, maintaining usable performance and 
scalability. The current prototype is merely a proof- 
of-concept and has many limitations, such as a restric- 
tive space hierarchy, limited file system size, no per- 
sistent storage, and inefficient cross-node communica- 
tion. Also, our "clean-slate" approach is motivated by 
research goals; a more realistic approach to deploying 
system-enforced determinism would be to add a deter- 
ministic "sandbox" [19','32| to a conventional OS. 

This paper makes three main contributions. First, we 



identify five OS design principles for system-enforced 
determinism, and illustrate their application in a novel 
kernel API. Second, we demonstrate ways to build famil- 
iar abstractions such as file systems and shared memory 
atop a kernel API restricted to deterministic primitives. 
Third, we present the first system that can enforce deter- 
ministic execution on multi-process computations with 
performance acceptable for "normal-case" use, at least 
for some (coarse-grained) parallel applications. 

Section|2]describes Determinator's kernel design prin- 
ciples and API, then Section [3] details its user-space ap- 
plication runtime. Section|4]examines our prototype im- 
plementation, and Section |5] evaluates it informally and 
experimentally. Finally, Section |6] outlines related work, 
and Section|7]concludes. 

2 The Determinator Kernel 

This section describes Determinator's underlying design 
principles, then its low-level execution model and kernel 
API. We do not expect normal applications to use the ker- 
nel API directly, but rather the higher-level abstractions 
the user-level runtime provides, as described in the next 
section. We make no claim that this API is the "right" de- 
sign for a determinism-enforcing kernel, but merely use 
it to explore design challenges and strategies. 

2.1 Kernel API Design Principles 

We first briefly outline the principles we developed 
in designing Determinator, which address the common 
sources of timing dependencies we are aware of. We 
further discuss the motivations and implications of these 
principles below as we detail the kernel API. We make 
no claim that this is a complete or conclusive list, but at 
least for Determinator these principles prove sufficient to 
offer a deterministic execution guarantee, for which we 
briefly sketch formal arguments later in Section lZ4l 

1. Isolate the working state of concurrent activities 
between synchronization points. Determinator's ker- 
nel API directly provides no shared state abstractions, 
such as global file systems or writeable shared memory. 
Concurrent activities operate within private "sandboxes," 
interacting only at deterministically defined synchroniza- 
tion points, eliminating timing dependencies due to inter- 
leaved access to shared state. 

2. Use local, application-chosen names instead of 
global, system-allocated names. APIs that assign 
names from a shared namespace introduce nondetermin- 
ism even when the named objects are unshared: execu- 
tion timing affects the pointers returned by malloc { ) 
or mmap () or the file numbers returned by open () 
in multithreaded Unix processes, and the process IDs 
returned by fork() or the file names returned by 
mktemp ( ) in single-threaded processes. To avoid these 
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Figure 1: The kernel's hierarchy of spaces, each contain- 
ing private register and virtual memory state. 

sources of nondeterminism, Determinator' s kernel API 
uses only local names chosen by the application: user- 
level code decides where to allocate memory and what 
process IDs to assign children. This principle ensures 
that naming a resource reveals no shared state informa- 
tion other than what the application itself provided. 

3. User code determines the participants in any syn- 
chronization operation, and the point in each par- 
ticipant's execution at which synchronization occurs. 

The kernel API allows a thread or process to synchro- 
nize with a particular target, like Unix processes use 
waitpid ( ) to wait for a specific child. The API does 
not support synchronizing with "any" or "the first avail- 
able" target as in Unix's wait ( ) , or interrupting an- 
other thread at a timing-dependent point in its execution, 
as with Unix signals. Nondeterministic synchronization 
APIs may be emulated deterministically, if needed for 
compatibility, as described in Section [33] 

4. Treat access to expUcit time sources as I/O. User 
code has no direct access to clocks counting either real 
time, as in gettimeofday ( ) , or nondeterministic 
"virtual time" measures, as in getrusage ( ) . Deter- 
minator treats such timing sources as I/O devices that 
user code may access only via controlled channels, as 
with other devices such as network, disk, and display. 

5. Separate application logic from scheduUng. De- 
terministic applications cannot make timing-dependent 
internal scheduhng or load-balancing decisions, as to- 
day's applications often do using thread pools or work 
queues. Applications may expose arbitrary parallelism 
and provide scheduling hints — in principle they could 
even download extensions into the kernel to customize 
scheduling fTOl — ^provided the kernel prevents custom 
scheduling policies from affecting computed results. 

2.2 Spaces 

Determinator executes application code within a hierar- 
chy of spaces, illustrated in Figure[T] Each space consists 
of CPU register state for a single control flow, and private 
virtual memory containing code and data directly acces- 
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Table 2: Options/arguments to the Put and Get calls. 

sible within that space. A Determinator space is analo- 
gous to a single-threaded Unix process, with several im- 
portant differences; we use the term "space" to highlight 
these differences and avoid confusion with the "process" 
and "thread" abstractions Determinator emulates at user 
level, as described later in Section[3j 

As in a nested process model ll29ll . a Determinator 
space cannot outlive its parent, and a space can directly 
interact only with its immediate parent and children via 
three system calls described below. Following princi- 
ple 1 above, the kernel provides no file systems, writable 
shared memory, or other shared state abstractions. 

Following principle 4, only the distinguished root 
space has direct access to nondeterministic I/O devices 
including clocks; other spaces can access I/O devices 
only indirectly via parent/child interactions, or via I/O 
privileges delegated by the root space. A parent space 
can thus control all nondeterministic inputs into any un- 
privileged space subtree, e.g., logging inputs for future 
replay. (This space hierarchy also creates a performance 
bottleneck for I/O-bound applications, a limitation of the 
current design we intend to address in future work.) 

2.3 System Call API 

Determinator spaces interact only as a result of proces- 
sor traps and the kernel's three system calls — Put, Get, 
and Ret, summarized in Table [T] Put and Get take sev- 
eral optional arguments, summarized in Table |2l Most 
options can be combined: e.g., in one Put call a space 
can initialize a child's registers, copy a range of the par- 
ent's virtual memory into the child, set page permissions 
on the destination range, save a complete snapshot of the 
child's address space, and start the child executing. 

As per principle 2 above, each space has a private 
namespace of child spaces, which user-level code man- 
ages. A space specifies a child number to Get or Put, and 
the kernel creates that child if it doesn't already exist, be- 
fore performing the requested operations. If the specified 
child did exist and was still executing at the time of the 
Put/Get call, the kernel blocks the parent's execution un- 
til the child stops due to a Ret system call or a processor 
trap. These "rendezvous" semantics ensure that spaces 
synchronize only at well-defined points in both spaces' 



Call Description 



Put 
Get 
Ret 



Copy register state and/or a virtual memory range into a child space, and optionally start the child executing. 
Copy register state, a virtual memory range, and/or changes since the last snapshot out of a child space. 
Stop and wait for parent to issue a Get or Put. 

Table 1 : System calls comprising Determinator's kernel API. 



execution, as required by principle 3. 

The Copy option logically copies a range of virtual 
memory between the invoking space and the specified 
child. The kernel uses copy-on-write to optimize large 
copies and avoid physically copying read-only pages. 

Merge is available only on Get calls. A Merge is like a 
Copy, except the kernel copies only bytes that dijfer be- 
tween the child's current and reference snapshots into the 
parent space, leaving other bytes in the parent untouched. 
The kernel also detects conflicts: if a byte changed in 
both the child's and parent's spaces since the snapshot, 
the kernel generates an exception, treating a conflict as 
a programming error like an illegal memory access or 
divide-by-zero. Determinator's user-level runtime uses 
Merge to give multithreaded processes the illusion of 
shared memory, as described later in Section l374l In prin- 
ciple, user-level code could implement Merge itself, but 
the kernel's direct access to page tables makes it easy for 
the kernel to implement Merge efficiently. 

Finally, the Ret system call stops the calling space, re- 
turning control to the space's parent. Exceptions such as 
divide-by-zero also cause a Ret, providing the parent a 
status code indicating why the child stopped. 

To facilitate debugging and prevent untrusted children 
from looping forever, a parent can start a child with an 
instruction limit, forcing control back to the parent af- 
ter the child and its descendants collectively execute this 
many instructions. Counting instructions instead of "real 
time" preserves determinism, while enabling spaces to 
"quantize" a child's execution to implement scheduling 
schemes deterministic ally at user level |8 ,22|. 

2.4 Reasoning about Determinism 

Can we be certain the kernel API above indeed guaran- 
tees that space subtrees execute deterministically despite 
parallelism? While a detailed proof is out of scope, we 
briefly sketch two formal arguments for this guarantee. 

The first argument leverages an existing formal paral- 
lel computing model: a Kahn process network 1.3 8 J is a 
network of single-threaded processes, which run sequen- 
tial code deterministically and interact only via blocking, 
one-to-one message channels. Under these restrictions, 
a Kahn network behaves deterministically. Determina- 
tor's Get, Put, and Ret calls are implementable in terms 
of messages on one-to-one channels, making Determina- 
tor's space hierarchy formally equivalent to a Kahn pro- 
cess network, thereby ensuring its determinism. 

For a more "first-principles" argument, consider a 
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Figure 2: A spaces migrating among two nodes and start- 
ing child spaces on each node. 

graph of possible execution traces of a space hierarchy. 
Each node represents a synchronization point in a possi- 
ble execution history of one space, vertical edges repre- 
sent local computation sequences in one space between 
synchronization points, and horizontal edges represent 
pairwise interactions where a parent space's Get or Put 
synchronizes with a child's Ret. From this graph we 
construct a "happens-before" partial order over all syn- 
chronization points in all possible executions. At each 
synchronization point, assuming all prior (on the partial 
order) computation sequences and synchronization inter- 
actions yield only one possible result for a given set of 
inputs, then the same is true after that synchronization 
point: each synchronization point in a parent space inter- 
acts with only one corresponding point in a specific child, 
and vice versa, and synchronization effects such as mem- 
ory changes depend only on the two spaces' states prior 
to synchronization. By induction on the partial order, the 
entire execution history is therefore deterministic. 

2.5 Distribution via Space Migration 

The kernel allows space hierarchies to span not only 
multiple CPUs in a multiprocessor/multicore system, but 
also multiple nodes in a cluster, mostly transparently 
to application code. While distribution is semantically 
transparent to applications, we say "mostly transpar- 
ently" because an application may have to be designed 
with distribution in mind to achieve acceptable perfor- 
mance. As with other aspects of the kernel's design, 
we make no pretense that this is the "right" approach to 
cross-node distribution, but merely one way to extend a 
deterministic execution model across a cluster 

Distribution support adds no new system calls or op- 
tions to the API above. Instead, the Determinator kernel 
interprets the higher-order bits in each process's child 



number namespace as a "node number" field. When 
a space invokes Put or Get, the kernel first logically 
migrates the calling space's state and control flow to 
the node whose number the user specifies as part of its 
child number argument, before creating and/or interact- 
ing with a child on that node specified in the remaining 
child number bits. Figure |2]illustrates a space migrating 
between two nodes and managing child spaces on each. 

Once created, a space has a home node, to which the 
space migrates when interacting with its parent on a Ret 
or trap. Nodes are numbered so that "node zero" in 
any space's child namespace always refers to the space's 
home node. If a space uses only the low bits in its 
child numbers and leaves the node number field zero, the 
space's children all have the same home as the parent. 

When the kernel migrates a space, it first transfers to 
the receiving kernel only the space's register state and 
address space summary information. Next, the receiving 
kernel requests the space's memory pages on demand as 
the space accesses them on the new node. Each node's 
kernel avoids redundant cross-node page copying in the 
common case when a space repeatedly migrates among 
several nodes — e.g., when a space starts children on each 
of several nodes, then returns later to collect their results. 
For pages that the migrating space only reads and never 
writes, such as program code, each kernel reuses cached 
copies of these pages whenever the space returns to that 
node. The kernel currently performs no prefetching or 
other adaptive optimizations. Its rudimentary messaging 
protocol runs directly atop Ethernet, and does not support 
TCP/IP for Internet-wide distribution. 

3 Emulating High-Level Abstractions 

The kernel API described above eliminates many conve- 
niences to which developers and users are accustomed. 
Can we reproduce them under the constraint of strict 
determinism? We find that many familiar abstractions 
remain feasible, although some semantically nondeter- 
ministic abstractions may be costly to emulate precisely. 
This section details the user-level runtime infrastructure 
we developed to emulate traditional Unix processes, file 
systems, threads, and synchronization under Determina- 
tor 

3.1 Processes and fork/exec/wait 

We make no attempt to replicate Unix process se- 
mantics exactly, but would like to emulate traditional 
fork/exec/wait APIs enough to support common 
uses in scriptable shells, build tools, and multi-process 
"batch processing" applications such as compilers. 

Fork: Implementing a basic Unix fork ( ) requires 
only one Put system call, to copy the parent's entire 
memory state into a child space, set up the child's reg- 



isters, and start the child. The difficulty arises from 
Unix's global process ID (PID) namespace, a source of 
nondeterminism violating our design principle 2 (Sec- 
tion |2Tll. Since most applications use PIDs returned by 
f ork { ) merely as an opaque argument to a subsequent 
waitpid ( ) , our runtime makes PIDs local to each pro- 
cess: one process's PIDs are unrelated to, and may nu- 
merically conflict with, PIDs in other processes. This 
change breaks Unix applications that pass PIDs among 
processes, and means that commands like 'ps' must be 
built into shells for the same reason that 'cd' akeady is. 
This simple approach works for compute-oriented appli- 
cations following the typical fork/wait pattern, however. 
Since f ork ( ) returns a PID chosen by the system, 
while our kernel API requires user code to manage child 
numbers, our user-level runtime maintains a "free list" of 
child spaces and reserves one during each fork ( ) . To 
emulate Unix process semantics more closely, a central 
space such as the root space could manage a global PID 
namespace, at the cost of requiring inter-space commu- 
nication during operations such as fork ( ) . 

Exec: A user-level implementation of Unix exec ( ) 
must construct the new program's memory image, in- 
tended to replace the old program, while still execut- 
ing the old program's runtime library code. Our run- 
time loads the new program into a "reserved" child space 
never used by fork ( ) , then calls Get to copy that 
child's entire memory atop that of the (running) parent: 
this Get thus "returns" into the new program. To ensure 
that the instruction address following the old program's 
Get is a valid place to start the new program, the run- 
time places this Get in a small "trampoline" code frag- 
ment mapped at the same location in the old and new 
programs. The runtime also carries over some Unix pro- 
cess state, such as the the PID namespace and file system 
state described later, from the old to the new program. 

Wait: When an application calls waitpid to wait 
for a specific child, the runtime calls Get to synchronize 
with the child's Ret and obtain the child's exit status. 
(The child may return to the parent before it wishes to 
terminate, in order to make I/O requests as described be- 
low; in this case, the parent's runtime services the I/O 
request and resumes the waitpid () transparently to 
the application.) 

Unix's wait { ) is more challenging, as it violates 
principle 3 by waiting for any (i.e., "the first") child to 
terminate. Our kernel's API provides no system call to 
"wait for any child," and can't (for unprivileged spaces) 
without violating its determinism guarantee. Instead, our 
runtime waits for the child that was forked earliest whose 
status was not yet collected. This behavior does not af- 
fect applications that fork one or more children and then 
wait for all of them to complete, but affects two com- 
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Figure 3: Example parallel make scheduling scenarios 
under Unix versus Determinator: (a) and (b) with unlim- 
ited parallelism (no user-level scheduling); (c) and (d) 
with a "2-worker" quota imposed at user level. 

mon uses of wait { ) . First, interactive Unix shells use 
wait to report when background processes complete; 
thus, an interactive shell running under Determinator re- 
quires special "nondeterminism privileges" to provide 
this functionality (and related functions such as interac- 
tive job control). Second, our runtime's behavior may 
adversely affect the performance of programs that use 
wait ( ) to implement dynamic scheduling or load bal- 
ancing in user space, which violates principle 5. 

Consider a parallel make run with or without limiting 
the number of concurrent children. A plain 'make - j ', 
allowing unlimited children, leaves scheduling decisions 
to the system. Under Unix or Determinator, the kernel's 
scheduler dynamically assigns tasks to available CPUs, 
as illustrated in Figure |3] (a) and (b). If the user runs 
'make - j2', however, then make initially starts only 
tasks 1 and 2, then waits for one of them to complete be- 
fore starting task 3. Under Unix, wait () returns when 
the short task 2 completes, enabling make to start task 3 
immediately as in (c). On Determinator, however, the 
wait returns only when (deterministically chosen) 
task 1 completes, resulting in a non-optimal schedule (d): 
determinism prevents the runtime from learning which 
of tasks 1 and 2 completed first. This example illustrates 
the importance of separating scheduling from application 
logic, as per principle 5. 

3.2 A Shared File System 

Unix's globally shared file system provides a convenient 
namespace and repository for staging program inputs, 
storing outputs, and holding intermediate results such as 
temporary files. Since our kernel permits no physical 
state sharing, user-level code must emulate shared state 
abstractions. Determinator's "shared-nothing" space hi- 
erarchy is similar to a distributed system consisting only 
of uniprocessor machines, so our user-level runtime bor- 
rows distributed file system principles to offer applica- 
tions a shared file system abstraction. 
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Figure 4: Each process's user-level runtime maintains an 
individual replica of a logically shared file system, us- 
ing file versioning to reconcile replicas at synchroniza- 
tion points. 

Since our current focus is on emulating familiar ab- 
stractions and not on developing storage systems, Deter- 
minator's file system currently provides no persistence: 
it effectively serves only as a temporary file system. 

While many distributed file system designs may be ap- 
plicable, our runtime uses replication with weak consis- 
tency |53 55|. Our runtime maintains a complete file 
system replica in the address space of each process it 
manages, as shown in Figure H) When a process cre- 
ates a child via f ork ( ) , the child inherits a copy of 
the parent's file system in addition to the parent's open 
file descriptors. Individual open/close/read/write 
operations in a process use only that process's file sys- 
tem replica, so different processes' replicas may diverge 
as they modify files concurrently. When a child termi- 
nates and its parent collects its state via wait ( ) , the 
parent's runtime copies the child's file system image into 
a scratch area in the parent space and uses file version- 
ing WTl to propagate the child's changes into the parent. 

If a shell or parallel make forks several compiler pro- 
cesses in parallel, for example, each child writes its out- 
put . o file to its own file system replica, then the par- 
ent's runtime merges the resulting . o files into the par- 
ent's file system as the parent collects each child's exit 
status. This copying and reconciliation is not as ineffi- 
cient as it may appear, due to the kernel's copy-on-write 
optimizations. Replicating a file system image among 
many spaces copies no physical pages until user-level 
code modifies them, so all processes' copies of identical 
files consume only one set of pages. 

As in any weakly-consistent file system, processes 
may cause conflicts if they perform unsynchronized, con- 
current writes to the same file. When our runtime detects 
a conflict, it simply discards one copy and sets a con- 
flict flag on the file; subsequent attempts to open () the 
file result in errors. This behavior is intended for batch 
compute applications for which conflicts indicate an ap- 
plication or build system bug, whose appropriate solu- 
tion is to fix the bug and re-run the job. Interactive use 
would demand a conflict handling policy that avoids los- 



ing data. The user-level runtime could alternatively use 
pessimistic locking to implement stronger consistency 
and avoid unsynchronized concurrent writes, at the cost 
of more inter-space communication. 

The current design's placement of each process's file 
system replica in the process's own address space has 
two drawbacks. First, it limits total file system size to 
less than the size of an address space; this is a serious 
limitation in our 32-bit prototype, though it may be less 
of an issue on a 64-bit architecture. Second, wild pointer 
writes in a buggy process may corrupt the file system 
more easily than in Unix, where a buggy process must 
actually call write () to corrupt a file. The runtime 
could address the second issue by write-protecting the 
file system area between calls to write ( ) , or it could 
address both issues by storing file system data in child 
spaces not used for executing child processes. 

3.3 Input/Output and Logging 

Since unprivileged spaces can access external I/O de- 
vices only indirectly via parent/child interaction within 
the space hierarchy, our user-level runtime treats I/O as 
a special case of file system synchronization. In addition 
to regular files, a process's file system image can contain 
special I/O files, such as a console input file and a console 
output file. Unlike Unix device special files, Determina- 
tor's I/O files actually hold data in the process's file sys- 
tem image: for example, a process's console input file 
accumulates all the characters the process has received 
from the console, and its console output file contains all 
the characters it has written to the console. 

When a process does a read ( ) from the console, 
the C library first returns unread data already in the pro- 
cess's local console input file. When no more data is 
available, instead of returning an end-of-file condition, 
the process calls Ret to synchronize with its parent and 
wait for more console input (or in principle any other 
form of new input) to become available. When the par- 
ent does a wait ( ) or otherwise synchronizes with the 
child, it propagates any new input it already has to the 
child. When the parent has no new input for any waiting 
children, it forwards all their input requests to its parent, 
and ultimately to the kernel via the root process. 

When a process does a console write ( ) , the run- 
time appends the new data to its internal console output 
file as it would append to a regular file. The next time the 
process synchronizes with its parent, file system recon- 
ciliation propagates these writes toward the root process, 
which forwards them to the kernel's I/O devices. A pro- 
cess can request immediate synchronization and output 
propagation by explicitly calling f sync ( ) . 

The file system reconciliation mechanism handles 
"append-only" writes differently from other file changes, 
enabhng processes to write concurrently to the console 



or to log files without conflict. During reconciliation, if 
both the parent and child process have made append-only 
writes to the same file, reconciliation appends the child's 
latest writes to the parent's copy of the file, and appends 
the parent's latest writes to the child's copy. Each pro- 
cess's output file thus accumulates all processes' concur- 
rent writes, though different processes may observe these 
writes in a different order. Unlike Unix, rerunning a par- 
allel computation from the same inputs with and without 
output redirection yields byte-for-byte identical console 
and log file output. 

3.4 Shared Memory Multithreading 

Shared memory multithreading is popular despite the 
nondeterminism it introduces into processes, in part be- 
cause parallel code need not pack and unpack messages: 
threads simply compute "in-place" on shared variables 
and structures. Since Determinator gives user spaces no 
physically shared memory other than read-only sharing 
via copy-on-write, emulating shared memory involves 
distributed shared memory (DSM) techniques. 

As with file systems, there are many approaches to 
DSM, but ours builds on release-consistent DSM [3 
.17.1 . which balances efficiency with programming con- 
venience. Although release consistency normally makes 
memory access behavior even less deterministic by re- 
laxing the rules of sequential consistency, we have 
adapted it into a memory model we call deterministic 
consistency (DC), which we detail elsewhere |5|. DCs 
roots lie in early parallel Fortran systems 17 501 . in which 
all processors make private copies of shared data at the 
beginning of a parallel "for" loop, then read and mod- 
ify only their private "workspaces" within the loop, and 
merge their results once all processors complete. 

DC propagates memory changes between threads 
predictably, only at program-defined synchronization 
points. If one thread executes the assignment 'x — y' 
while another concurrently executes 'y = x\ for exam- 
ple, this code yields a nondeterministic data race in stan- 
dard memory models, but in DC it is race-free and always 
swaps X with y. DCs semantics might simplify simu- 
lations in which threads running in lock-step read and 
update large arrays in-place, for example. The absence 
of read/write conflicts in DC also simplifies implementa- 
tion, eliminating the need to execute parallel sequences 
speculatively and risk aborting and wasting effort if a de- 
pendency is detected, as when deterministically emulat- 
ing sequential consistency 18,9, 221 . 

Our runtime uses the kernel's Snap and Merge opera- 
tions (Section lZST l to emulate shared memory with deter- 
ministic consistency and "fork/join" thread synchroniza- 
tion. To fork a child, the parent thread calls Put with the 
Copy, Snap, Regs, and Start options to copy the shared 
part of its memory into a child space, save a snapshot of 
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Figure 5: A multithreaded process built from one space 
per thread, with a master space managing synchroniza- 
tion and memory reconciliation. 

that memory state in the child, and start the child run- 
ning, as illustrated in Figure |5] The master thread may 
fork multiple children in parallel this way. To synchro- 
nize with a child and collect its results, the parent calls 
Get with the Merge option, which merges all changes the 
child made to its shared address space, since the child's 
snapshot was taken, back into the parent's space. If both 
the parent and child — or the child and other children 
whose changes the parent has collected — have concur- 
rently modified the same shared memory byte since the 
snapshot, the kernel detects and reports this write/write 
conflict (which is DCs only form of data race). 

Our runtime also supports barriers, the foundation of 
data-parallel programming models like OpenMP lfT2l . 
When each thread in a group arrives at a barrier, it calls 
Ret to stop and wait for the parent thread managing the 
group. The parent calls Get with Merge to collect each 
child's changes before the barrier, then calls Put with 
Copy and Snap to resume each child with a new shared 
memory snapshot containing all threads' prior results. 
While DC conceptually extends to non-hierarchical syn- 
chronization patterns as well [5|, such as Lisp-style fu- 
tures ll34l . our kernel's current strict space hierarchy nat- 
urally supports only hierarchical synchronization, a lim- 
itation we intend to address in the future. Any synchro- 
nization abstraction may be emulated at some cost as de- 
scribed in the next section, however. 

An application can choose which parts of its address 
space to share and which to keep thread-private. By plac- 
ing thread stacks outside the shared region, all threads 
can reuse the same stack area, and the kernel wastes no 
effort merging stack data. If threads wish to pass point- 
ers to stack-allocated structures, however, then they may 
locate their stacks in disjoint shared regions. Similarly, 



md5search(unsigned char *hash, int len, int nthreads) 
char buf[len+i], output[len+l]; 
int done = 0, found = 0, ;'; 
flrst_string(&fcM/', len); 
while (Idone && [found) 

for (i = 0; ( < nthreads; i++) 

next_string(&Z;;(/, len, &done); 
if (threadJorkC;) == IN.CHILD) 
check_md5(&fa(/, hash, Scoutput, &found); 
thread_exit(); 
for (i = 0;i < nthreads; i++) 
threadJoln(;); 



Figure 6: Pseudocode for parallel "MD5 cracker." 

if the file system area is shared, then the threads share a 
common file descriptor namespace as in Unix. Excluding 
the file system area from shared space and using normal 
file system reconciliation (Section[32]i to synchronize it 
yields thread-private file tables. 

The C pseudocode in Figure |6] a simplified frag- 
ment of a brute-force "MD5 cracking" benchmark 
we use later in Section |5] illustrates two convenient 
properties of deterministic consistency. First, since 
threads can have private stacks in overlapping address 
ranges, thread_f ork ( ) acts like Unix's process-level 
fork ( ) , cloning the parent's stack into the child, so the 
program need not separate the child thread's code into 
a separate function as pthreads requires. Second, the 
parent thread's next_string ( ) call updates buf in- 
place before forking each child, whose "work function" 
check_md5 ( ) refers to this buffer In a nondetermin- 
istic thread model, this code contains a data race: the 
parent may update buf for the next child before the pre- 
vious child has finished reading it. Under Determinator, 
however, this code is race-free: each child's view of buf 
remains as it was when that child was forked, until the 
child explicitly calls thread_exit ( ) . 

3.5 Legacy Synchronization APIs 

Although some synchronization abstractions naturally fit 
a deterministic model, others do not. Mutex locks are 
semantically nondeterministic: that they guarantee that 
only one thread may own a lock at once, but allow com- 
peting threads to acquire the lock in any order. Condition 
variables, semaphores, and message queues allow multi- 
ple threads to race to signal, post, or send, respectively, 
and these events wake up any of several waiting or read- 
ing threads, violating our principle 3. 

For existing sequential code not yet parallelized, 
we hope this code might be parallelized using nat- 
urally deterministic synchronization abstractions Uke 
data-parallel programming models such as OpenMP lfT2ll 
and SHIM 1:26 J provide. For code already paralleUzed 



using nondeterministic synchronization, however, Deter- 
minator's runtime can emulate the standard pthreads API 
via deterministic scheduling |8 , 9 , 22], at certain costs. 

In a process that uses nondeterministic synchroniza- 
tion, the process's initial master space never runs ap- 
plication code directly, but instead runs a determinis- 
tic scheduler. This scheduler creates one child space 
to run each application thread. The scheduler runs the 
threads under an artificial execution schedule, emulating 
a schedule by which a true shared-memory multiproces- 
sor might in principle run them, but using a determinis- 
tic, virtual notion of "time" — e.g., number of instructions 
executed — to schedule thread interactions. 

Like DMP f8"22l, our deterministic scheduler quan- 
tizes each thread's execution by preempting it after exe- 
cuting a fixed number of instructions. Whereas DMP im- 
plements preemption by instrumenting user-level code, 
our scheduler uses the kernel's instruction limit feature 
(Section |23] |. The scheduler "donates" execution quanta 
to threads round-robin, allowing each thread to run con- 
currently with other threads for one quantum, before col- 
lecting the thread's shared memory changes via Merge 
and restarting it for another quantum. 

A thread's shared memory writes propagate to other 
threads only at the end of each quantum, violating se- 
quential consistency |'4T1. Like DMP-B [8l, our deter- 
ministic scheduler implements release consistency OTI . 
totally ordering only synchronization operations. To en- 
force this total order, each synchronization operation 
could simply spin for a a full quantum. To avoid wasteful 
spinning, however, our synchronization primitives inter- 
act with the deterministic scheduler directly. 

Each mutex, for example, is always "owned" by some 
thread, whether or not the mutex is locked. The mutex's 
owner can lock and unlock the mutex without scheduler 
interactions, but any other thread needing the mutex must 
first invoke the scheduler to obtain ownership. At the 
current owner's next quantum, the scheduler "steals" the 
mutex from its current owner if the mutex is unlocked, 
and otherwise places the locking thread on the mutex's 
queue to be awoken once the mutex is available. 

Since the scheduler can preempt threads at any 
point, a challenge common to any preemptive sce- 
nario is making synchronization functions such as 
pthread_mutex_lock ( ) atomic. The kernel does 
not allow threads to disable or extend their own instruc- 
tion limits, since we wish to use instruction limits at pro- 
cess level as well, e.g., to enforce deterministic "time" 
quotas on untrusted processes, or to improve user-level 
process scheduling (see Section [TT]) by quantizing pro- 
cess execution. After synchronizing with a child thread, 
therefore, the master space checks whether the instruc- 
tion limit preempted a synchronization function, and if 
so, resumes the preempted code in the master space. Be- 



fore returning to the application, these functions check 
whether they have been "promoted" to the master space, 
and if so migrate their register state back to the child 
thread and restart the scheduler in the master space. 

While deterministic scheduling provides compatibility 
with existing parallel code, it has drawbacks. The master 
space, required to enforce a total order on synchroniza- 
tion operations, may be a scaling bottleneck unless exe- 
cution quanta are large. Since threads can interact only 
at quanta boundaries, however, large quanta increase the 
time one thread may waste waiting for another, to steal 
an unlocked mutex for example. 

Further, since the deterministic scheduler may pre- 
empt a thread and propagate shared memory changes at 
any point in application code, the programming model 
remains nondeterministic. If one thread runs 'x — y' 
while another runs 'y = x\ the result may be repeatable 
but is no more predictable to the programmer than on tra- 
ditional systems — in contrast with the previous section's 
multithreading model. While rerunning a program with 
exactly identical inputs will yield identical results, if the 
input is perturbed to change the length of any instruction 
sequence, these changes may cascade into a different ex- 
ecution schedule and trigger schedule-dependent if not 
timing-dependent heisenbugs. 

4 Prototype Implementation 

Determinator is implemented in C with small assembly 
fragments, runs on the 32-bit x86 architecture, and im- 
plements the kernel API and user-level runtime facilities 
described above. Source code is available on request. 

Since our focus is on parallel compute-bound applica- 
tions, Determinator' s I/O capabilities are currently lim- 
ited. The system provides text-based console I/O and a 
Unix-style shell supporting redirection and both scripted 
and interactive use. The shell offers no interactive job 
control, which would require currently unimplemented 
"nondeterministic privileges" (Section FS.lt . The system 
has no demand paging or persistent disk storage: the 
user-level runtime's logically shared file system abstrac- 
tion currently operates in physical memory only. 

The kernel supports application-transparent space mi- 
gration among up to 32 machines in a cluster, as de- 
scribed in Section 12.51 Migration uses a synchronous 
messaging protocol with only two request/response types 
and implements almost no optimizations such as page 
prefetching. The protocol runs directly atop Ethernet, 
and is not intended for Internet-wide distribution. 

Implementing instruction limits (Section [273] ) requires 
the kernel to recover control after a precise number of 
instructions execute in user mode. While the PA-RISC 
architecture provided this feature lUl, the x86 does not, 
so we borrowed ReVirt's technique f23l. We first set an 
imprecise hardware performance counter, which unpre- 



dictably overshoots its target a small amount, to interrupt 
the CPU before the desired number of instructions, then 
run the remaining instructions under debug tracing. 

5 Evaluation 

This section evaluates the Determinator prototype, first 
informally, then examining single-node and distributed 
parallel processing performance, and finally code size. 

5.1 Experience Using the System 

We find that a deterministic programming model sim- 
plifies debugging of both applications and user-level 
runtime code, since user-space bugs are always repro- 
ducible. Conversely, when we do observe nondetermin- 
istic behavior, it can result only from a kernel (or hard- 
ware) bug, immediately limiting the search space. 

Because Determinator' s file system holds a process's 
output until the next synchronization event (often the 
process's termination), each process's output appears 
as a unit even if the process executes in parallel with 
other output-generating processes. Further, different pro- 
cesses' outputs appear in a consistent order across runs, 
as if run sequentially. (The kernel provides a system call 
for debugging that outputs a line to the "real" console im- 
mediately, reflecting true execution order, but chaotically 
interleaving output like standard systems.) 

While race detection tools exist II271I45II . we found it 
convenient that Determinator detects races all the time 
under "normal-case" execution, without requiring the 
user to run a special tool. Since the kernel detects shared 
memory conflicts and the user-level runtime detects file 
system conflicts at every synchronization event, Deter- 
minator's model makes race detection as standard as de- 
tecting division by zero or illegal memory accesses. 

A subset of Determinator doubles as PIOS, "Paral- 
lel Instructional Operating System," which we used in 
Yale's operating system course this spring. While the 
OS course's objectives did not include determinism, they 
included introducing students to parallel, multicore, and 
distributed operating system concepts. For this purpose, 
we found Determinator/PIOS to be a useful instructional 
tool due to its simple design, minimal kernel API, and 
adoption of distributed systems techniques within and 
across physical machines. PIOS is partly derived from 
MIT's JOS 1.37.1 . and includes a similar instructional 
framework where students fill in missing pieces of a 
"skeleton." The twelve students who took the course, 
working in groups of two or three, all successfully reim- 
plemented Determinator' s core features: multiproces- 
sor scheduling with Get/Put/Ret coordination, virtual 
memory with copy-on-write and Snap/Merge, user-level 
threads with fork/join synchronization (but not determin- 
istic scheduling), the user-space file system with ver- 
sioning and reconciliation, and application-transparent 
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Figure 7: Determinator performance relative to Linux on 
various parallel benchmarks. 

cross-node distribution via space migration. In their fi- 
nal projects they extended the OS with features such as 
graphics, pipes, and a remote shell. While instructional 
use is by no means indicates a system's real-world utility, 
we find the success of the students in understanding and 
building on Determinator's architecture promising. 

5.2 Single-node Multicore Performance 

Since Determinator runs user-level code "natively" on 
the hardware instead of rewriting user code 18,221, we 
expect it to perform comparably to conventional systems 
when executing single-threaded, compute-bound code. 
Since space interactions require system calls, context 
switches, and virtual memory operations, however, we 
expect determinism to incur a performance cost in pro- 
portion to the amount of interaction between spaces. 

Figure |7] shows the performance of several shared- 
memory parallel benchmarks we ported, relative to the 
same benchmarks running on the 32-bit version of 
Ubuntu Linux 9.10. The md5 benchmark searches for 
an ASCII string yielding a particular MD5 hash, as in 
a brute-force password cracker; matmult multiplies two 
1024 X 1024 integer matrices; (^.sort performs a recursive 
parallel quicksort on an integer array; blackscholes is a fi- 
nancial benchmark from the PARSEC suite 1 1 1 ); andj^f, 
lu^cont, and lujioncont are Fast Fourier Transform and 
LU-decomposition benchmarks from SPLASH-2 \5E\. 
We tested all benchmarks on a 2 socket x 6 core, 2.2GHz 
AMD Opteron PC. 

Coarse-grained benchmarks like md5, matmult, qsort, 
blackscholes, and J^f show performance comparable with 
that of nondeterministic multithreaded execution under 
Linux. The md5 benchmark shows better scaling on De- 
terminator than on Linux, achieving a 2.25 x speedup 
over Linux on 12 cores. We have not identified the pre- 
cise cause of this speedup over Linux but suspect scaling 
bottlenecks in Linux's thread system Ii54il . 

Porting the blackscholes benchmark to Determinator 
required no changes as it uses deterministically sched- 
uled pthreads (Section 13.51 ). The deterministic sched- 
uler's quantization, however, incurs a fixed performance 
cost of about 35% for the chosen quantum of 10 million 
instructions. We could reduce this overhead by increas- 
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Figure 8: Determinator parallel speedup over single- 
CPU performance on various benchmarks. 



s 



a^ 



^ 



ESI 



rfw 



128x128 256x256 512x512 1024x1024 



Matrix size 

D 1 CPU D 2 CPUs 14 CPUs ■ 8 CPUs ■ 12 CPUs 



Figure 9: Matrix multiply with varying matrix size. 

ing the quantum, or eliminate it by porting the bench- 
mark to Determinator' s "native" parallel API. 

The fine-grained lu benchmarks show a higher per- 
formance cost, indicating that Determinator' s virtual 
memory-based approach to enforcing determinism is not 
well-suited to fine-grained parallel applications. Future 
hardware enhancements might make determinism practi- 
cal for fine-grained parallel applications, however ll22l . 

Figure [8] shows each benchmark's speedup relative to 
single-threaded execution on Determinator The "embar- 
rassingly parallel" md5 and blackscholes scale well, mat- 
mult andj^ level off after four processors (but still per- 
form comparably to Linux as Figure |7] shows), and the 
remaining benchmarks scale poorly. 

To quantify further the effect of parallel interaction 
granularity on deterministic execution performance. Fig- 
ures |9] and [TO] show Linux-relative performance of mat- 
mult and qsort, respectively, for varying problem sizes. 
With both benchmarks, deterministic execution incurs a 
high performance cost on small problem sizes requiring 
frequent interaction, but on large problems Determinator 
is competitive with and sometimes faster than Linux. 

5.3 Distributed Computing Performance 

While Determinator' s rudimentary space migration (Sec- 
tion 12. 5t is far from providing a full cluster comput- 
ing architecture, we would like to test whether such a 
mechanism can extend a deterministic computing model 
across nodes with usable performance at least for some 
applications. We therefore changed the md5 and mat- 
mult benchmarks to distribute workloads across a clus- 
ter of up to 32 uniprocessor nodes via space migration. 
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Figure 10: Parallel quicksort with varying array size. 
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Figure 11: MD5 benchmark on varying-size clusters. 

Both benchmarks still run in a (logical) shared memory 
model via Snap/Merge. Since we did not have a clus- 
ter on which we could run Determinator natively, we ran 
it under QEMU [6J, on a cluster of 2 socket x 2 core, 
2.4GHz Intel Xeon machines running SuSE Linux ILL 

Figure [TT| shows parallel speedup under Determinator 
relative to local single-node execution in the same envi- 
ronment, on a log-log scale. In md5-circuit, the master 
space acts like a traveling salesman, migrating serially to 
each "worker" node to fork child processes, then retrac- 
ing the same circuit to collect their results. The md5-tree 
variation forks workers recursively in a binary tree: the 
master space forks children on two nodes, those children 
each fork two children on two nodes, etc. The matmult- 
tree benchmark implements matrix multiply with recur- 
sive work distribution as in md5-tree. 

The "embarrassingly parallel" md5-tree performs and 
scales well, but only with recursive work distribution. 
Matrix multiply levels off at two nodes, due to the 
amount of matrix data the kernel transfers across nodes 
via its simplistic page copying protocol, which currently 
performs no data streaming, prefetching, or delta com- 
pression. The slowdown for 1-node distributed execution 
in matmult-tree reflects the cost of transferring the matrix 
to a (single) remote machine for processing. 

Figure [12] shows that the shared memory md5-tree 
and matmult-tree benchmarks, running on Determina- 
tor, perform comparably to nondeterministic, distributed- 
memory equivalents running on Puppy Linux 4.3. L in 
the same QEMU environment. The Determinator version 
of md5 is 63% the size of the Linux version (62 lines con- 
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Figure 12: Deterministic, shared-memory MD5 bench- 
mark compared with a nondeterministic, distributed- 
memory Linux implementation. 
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Kernel core 


2044 


1847 


Hardware/device drivers 


751 


647 


User-level runtime 


2952 


1079 


Generic C library code 


6948 


394 


User-level programs 


1797 


1418 


Total 


14,492 


5385 



Table 3 : Implementation code size of the Determinator 

05 and of PIOS, its instructional subset. 

taining semicolons versus 99), which uses remote shells 
to coordinate workers. The Determinator version of mat- 
mult is 34% the size of its Linux equivalent (90 Unes ver- 
sus 263), which passes data via TCP. 

5.4 Implementation Complexity 

To provide a feel for implementation complexity, Table[3] 
shows source code line counts for Determinator, as well 
as its PIOS instructional subset, counting only lines con- 
taining semicolons. The entire system is less than 15,000 
lines, about half of which is generic C and math library 
code needed mainly for porting Unix applications easily. 

6 Related Work 

The benefits of deterministic programming models are 
well-known II13II43I . Recognizing these benefits, paral- 
lel languages such as SHIM l25l|26l|^ and DPI \M 
[T4l enforce determinism at language level, but cannot 
run legacy or multi-process parallel code. Race detec- 
tors f27 45| can detect heisenbugs in nondeterministic 
parallel programs, but may miss heisenbugs resulting 
from higher-level order dependencies f3l. Language ex- 
tensions can dynamically check determinism assertions 
in parallel code lfT6ll48l . but heisenbugs may persist if 
the programmer omits an assertion. Only a deterministic 
environment prevents heisenbugs in the first place. 

Application-level deterministic schedulers such as 
DMP f22l, Grace (9], and CoreDet (8] instrument an ap- 
plication process to isolate threads' memory accesses, 
and run the threads on an artificial, deterministic exe- 
cution schedule. DMP and CoreDet isolate threads via 
code rewriting, while Grace uses virtual memory tech- 



niques as in Determinator Since these schedulers run in 
the same process as the application itself, bugs or ma- 
licious code can violate determinism by corrupting the 
scheduler, as the authors acknowledge. Determinator' s 
kernel-enforced model ensures repeatability of arbitrary 
code in both multithreaded and multi-process computa- 
tions. Determinator' s user-level runtime also develops 
deterministic versions of OS abstractions such as shared 
file systems, which lie outside the domain of application- 
level deterministic schedulers. 

DMP and Grace emulate sequential consistency BTJI 
by running parallel tasks speculatively, detecting 
read/write dependencies between tasks, and re-executing 
tasks serially on detecting a dependency. DMP-B 181 
relaxes memory consistency to optimize parallel execu- 
tion, but still emulates a nondeterministic programming 
model where writes propagate between threads at arbi- 
trary points unpredictable to the developer. Determinator 
combines ideas from early parallel Fortran systems ||7] 
50 1 with release consistency f2l[l2l[3Tl[39l to develop a 
"naturally deterministic" programming model Q . In this 
model, read/write conflicts do not exist (only write/write 
conflicts), and shared memory or file changes propa- 
gate among concurrent threads or processes only at ex- 
plicit synchronization points. While focusing on this de- 
terministic programming model, Determinator's runtime 
can emulate nondeterministic models via deterministic 
scheduling to run legacy parallel code. 

Many techniques are available for logging and replay- 
ing nondeterministic events in parallel applications 11211 
|28l|42l|46l. SMP-ReVirt can log and replay a multi- 
processor virtual machine 1*241, supporting uses such as 
system-wide intrusion analysis 12311361 and replay de- 
bugging |40|. Logging a parallel system's nondetermin- 
istic events is costly in performance and storage space, 
however, and usually infeasible for "normal-case" ex- 
ecution. Determinator demonstrates the feasibility of 
providing system-enforced determinism for normal-case 
execution, without internal event logging, while main- 
taining performance comparable with current systems at 
least for coarse-grained parallel applications. 

Transactional memory (TM) ll35ll5TI isolate threads' 
writes from each other between transaction start and 
commit/abort. TM offers no deterministic ordering be- 
tween transactions, however: like mutex locks, transac- 
tions guarantee only atomicity, not determinism. 

7 Conclusion 

Determinator is only a first step towards making deter- 
ministic execution readily available and broadly usable 
for normal-case execution of parallel applications. Nev- 
ertheless, our experiments suggest that, with appropri- 
ate kernel and user-level runtime designs, it is possible 
to provide system-enforced deterministic execution effi- 
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ciently at least for coarse-grained parallel applications, 
both on a single multicore machine and across a cluster 
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